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SPARSE MODELING OF CATEGORIAL EXPLANATORY 

VARIABLES 

By Jan Gertheiss 1 and Gerhard Tutz 
Ludwig-Maximilians- Universitdt Munich 

Shrinking methods in regression analysis are usually designed for 
metric predictors. In this article, however, shrinkage methods for cat- 
egorial predictors are proposed. As an application we consider data 
from the Munich rent standard, where, for example, urban districts 
are treated as a categorial predictor. If independent variables are 
categorial, some modifications to usual shrinking procedures are nec- 
essary. Two Li-penalty based methods for factor selection and clus- 
tering of categories are presented and investigated. The first approach 
is designed for nominal scale levels, the second one for ordinal predic- 
tors. Besides applying them to the Munich rent standard, methods 
are illustrated and compared in simulation studies. 

1. Introduction. Within the last decade regularization, and in particular 
variable selection, has been a topic of intensive research. With the introduc- 
tion of the Lasso, proposed by Tibshirani (1996), sparse modeling in the 
high-dimensional predictor case with good performance, in terms of identi- 
fication of relevant variables combined with good performance in predictive 
power, became possible. In the following many alternative regularized es- 
timators that include variable selection were proposed, among them the 
Elastic Net [Zou and Hastie (2005)], SCAD [Fan and Li (2001)], the Dantzig 
Selector [Candes and Tao (2007)] and Boosting approaches [for example, 
Biihlmann and Yu (2003)]. 

This article provides a regularized regression analysis of Munich rent stan- 
dard data. All larger German cities publish so-called rent standards for 
having guidelines available to tenants, landlords, renting advisory boards 
and experts. These rent standards are used, in particular, to determine the 
local comparative rent. For the composition of rent standards, a represen- 
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tative random sample is drawn from all relevant households, and the in- 
teresting data are determined by interviewers by means of questionnaires. 
The data analyzed come from 2053 households interviewed for the Munich 
rent standard 2003. The response is monthly rent per square meter in Euro. 
The predictors are ordered as well as unordered and binary factors. A de- 
tailed description is given in Table 1. The data can be downloaded from the 
data archive of the Department of Statistics at the University of Munich 
(http : //www . stat .uni-muenchen.de/service/datenarchiv). The direct 
link is found there. 

For example, the urban district is given as a nominal predictor with 25 
possible values. The decade of construction can be interpreted as ordinal 
with 10 levels. Usually such data are analyzed via standard linear regression 
modeling, with (for example) dummy coded categorial explanatory vari- 
ables. In the present situation such modeling is possible, since the number 
of observations (2053) is quite high. Nevertheless, from the viewpoint of in- 
terpretation, model selection is desired with the focus on reducing model 
complexity. 

In selection problems for categorical predictors as in the Munich rend 
data example, it should be distinguished between two problems: 

• Which categorical predictors should be included in the model? 

• Which categories within one categorical predictor should be distinguished? 

The latter problem is concerned with one variable and poses the question of 
which categories differ from one another with respect to the dependent vari- 
able. Or, to put it in a different way, which categories should be collapsed? 
The answer to that question depends on the scale level of the predictor, one 
should distinguish between nominal and ordered categories because of their 
differing information content. 

When investigating which of the 25 urban districts of Munich are to be 
distinguished with respect to the local rent, the number of possible combi- 
nations is huge. If only urban districts are used as categorial predictor in a 
regression model to explain the monthly rent, and districts are potentially 
fused (without further restrictions), the number of possible models — which 
just follow from different fusion results — is greater than 10 18 . In cases like 
that — that is, when the number of possible models is large — regularization 
techniques which induce sparsity are a promising approach for model se- 
lection. The extent of regularization — and hence sparsity — is typically con- 
trolled by a tuning parameter. Via choosing this parameter, the model is 
also implicitly selected. 

Most of the regularization techniques developed so far focus on the selec- 
tion of variables in the case where the effect of one variable is determined by 
one coefficient. That means coefficients are selected rather than variables. 
When all predictors are metric and a main effect model is assumed to hold, of 
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Table 1 

Explanatory variables for monthly rent per square meter 



Urban district 

Year of construction 
Number of rooms 
Quality of residential area 
Floor space (in m 2 ) 

Hot water supply 
Central heating 
Tiled bathroom 

Supplementary equipment in bathroom 
Well equipped kitchen 



Nominal, labeled by numbers 
1,...,25 

Given in ordered classes [1910, 1919], 
[1920,1929],... 

Taken as ordinal factor with levels 
1,2,. ..,6 

Ordinal, with levels "fair," 
"good," "excellent" 
Given in ordered classes (0,30), 
[30,40), [40,50), ... , [140, oo) 

Binary (yes/no) 
Binary (yes/no) 
Binary (yes/no) 
Binary (no/yes) 
Binary (no/yes) 



course selection of coefficients is equivalent to selection of predictor variables 
and model selection. This is different when categorical variables have to be 
included because then a whole group of coefficients refers to one variable. 

To be more concrete, let us first consider just one categorial predictor C € 
{0, . . . , k} and dummy coding xi = I{c=i}- Then the classical linear model is 
given as 

k 

y = a + ^ftx, + e, 

i=0 

with E(e) = and Var(e) = a 2 . If category is chosen as reference, coeffi- 
cient /3q is fixed to zero. When computing a penalized estimate, for example, 
by use of the simple Lasso [Tibshirani (1996)], the shrinkage effect depends 
on the coding scheme that is used and the choice of the reference category. 
With category zero chosen as reference, shrinkage always refers to the differ- 
ence between category i and zero. Moreover, Lasso type penalties tend to set 
some coefficients to zero. Usually this feature is seen as a great advantage 
over methods like Ridge regression, since it can be used for model/variable 
selection. Applied to dummy coded categorial predictors, however, selection 
only refers to the currently chosen reference category. In most cases of nomi- 
nal predictors, class labeling and choice of the reference category is arbitrary, 
which means that the described selection procedures are not really mean- 
ingful. In addition, the estimated model is not invariant against irrelevant 
permutations of class labels. 
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Fig. 1. Map of Munich indicating urban districts; colors correspond to estimated dummy 
coefficients if an ordinary least squares model is fitted with predictors from Table 1 and 
response monthly rent per square meter (in Euro). 

One of the few approaches that explicitly select categorical predictors 
was proposed by Yuan and Lin (2006) under the name Group Lasso. The 
approach explicitly includes or excludes groups of coefficients that refer to 
one variable. However, while the Group Lasso only attacks the problem of 
factor selection, for categorical predictor variables with many categories a 
useful strategy is to (additionally) search for clusters of categories with sim- 
ilar effects. As already described, in the presented application (among other 
things) we try to model the influence of the urban district where a person 
lives on the rent she/he has to pay. In Figure 1 a map of Munich is drawn 
with color coded urban districts. Colors correspond to dummy coefficients if 
an ordinary least squares model is fitted with (dummy coded) explanatory 
variables from Table 1 and response monthly rent per square meter (in Euro) . 
Some districts are hard to distinguish. That means it can be expected that 
not all districts do differ substantially. If an ordinary least squares model is 
fitted, however, estimated dummy coefficients (almost surely) differ. There- 
fore, the aim is to combine districts which (on average) do not substantially 
differ in terms of rent per square meter. Generally speaking, that means the 
objective is to reduce the k + 1 categories to a smaller number of categories 
which form clusters. The effect of categories within one cluster is supposed 
to be the same but responses will differ across clusters. Therefore, in a re- 
gression model corresponding dummy coefficients should be equal. Since, 
however, the number of possible clustering results — and hence the number 
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of models — tends to be very large (as already mentioned), model selection 
via regularization is quite attractive. 

Clustering or fusion of metric predictors may, for example, be obtained by 
so-called Variable Fusion [Land and Friedman (1997)] and the Fused Lasso 
proposed by Tibshirani et al. (2005). If predictors can be ordered, by putting 
a Li-penalty on differences of adjacent coefficients many of these differences 
are set to zero, yielding a piecewise constant coefficient function. Recently, 
Bondell and Reich (2009) adapted this methodology for factor selection and 
level fusion in ANOVA, to obtain dummy coefficients that are constant over 
some of the categories. The main focus of Bondell and Reich (2009), however, 
was on ANOVA typical identification of differences, not on model building 
as in our case, where prediction accuracy is also an important aspect. So 
in the following the method is reviewed and adapted to regression type 
problems. Some modifications are proposed and an approximate solution is 
presented which allows for easy computation of coefficient paths. In addition, 
the method is adapted to the modeling of ordinal predictors. 

Figure 2 shows paths of dummy coefficients for the rent data obtained by 
the method used in this article. The coefficients at value s/s max = 1 corre- 
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spond to the ordinary least squares model. It is seen that with decreasing 
tuning parameter s, categories are successively fused, that is, coefficients 
are set equal. Besides the urban district, several other covariates are given, 
among them the (categorized) year of construction. Corresponding paths of 
dummy coefficients are also shown in Figure 2. 

2. Regularization for categorical predictors. In the following we consider 
the penalized least squares criterion 

(1) QM = (y~ Xf3) T (y - Xf3) + A J((3), 

with design matrix X , coefficient vector j3 and penalty J(/3); y contains the 
observed response values. The estimate of (3 is given by 

(2) /3 = argmin{Q p (/3)}. 

P 

The decisive point is a suitable choice of penalty J(f3). We start with the case 
of one categorial explanatory variable and will distinguish between nominal 
and ordinal predictors. 

2.1. Unordered categories. If the categorial predictor has only nominal 
scale level, a modification of Variable Fusion [Land and Friedman (1997)] 
and the Fused Lasso [Tibshirani et al. (2005)] has been proposed by Bondell 
and Reich (2009) in the form of the penalty 

(3) J(P) = Y,™ij\Pi-P j \, 

i>j 

with weights Wij and /3j denoting the coefficient of dummy Xj. Since the 
ordering of xq, . . . , xt is arbitrary, not only differences fii — /3j_i (as in original 
fusion methodology), but all differences /3j — f3j are considered. Since i = is 
chosen as reference, /3o = is fixed. Therefore, in the limit case, A — > oo, all 
fa are set to zero and the categorial predictor C is excluded from the model 
since no categories are distinguished anymore. For A < oo the Lasso type 
penalty (3) sets only some differences (5i — j3j to zero, which means that 
categories are clustered. With adequately chosen weights w^, some nice 
asymptotic properties like selection and clustering consistency of (3 can be 
derived. These (adaptive) weights decisively depend on the distance of the 
ordinary least squares estimates fi\ LS ^ and (3j LS ^ . For details see Proposition 
1 in the Appendix. The issue, how to select concrete weights in the n < oo 
case, is further addressed in Sections 2.5 and 3.2. 
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2.2. Ordered categories. An interesting case are selection strategies for 
ordinal predictors, as, for example, the decade of construction from Table 1. 
Ordered categories contain more information than unordered ones, but the 
information has not been used in the penalties considered so far. Since in the 
case of ordered categories the ordering of dummy coefficients is meaningful, 
original fusion methodology can be applied, which suggests penalty 



with /3o = 0. In analogy to asymptotic properties for the unordered case, 
with adequately chosen weights Wi, similar results can be derived; see the 
Appendix for details. 

2.3. Computational issues. For the actual application of the proposed 
method a fitting algorithm is needed. For that purpose it is useful to con- 
sider the penalized minimization problem (2) as a constrained minimization 
problem. That means (y — X(3) T (y — X(3) is minimized subject to a con- 
straint. For unordered categories the constraint corresponding to penalty 



with /3o = 0. There is a one-to-one correspondence between the bound s and 
penalty parameter A in (1); cf. Bondell and Reich (2009). For estimation 
purposes we consider transformed parameters 9ij = — f3j which yield vector 
6 — (#10)#20j • • • ,&k,k-i) T - If is directly estimated (instead of one has 
to take into account that restrictions 6ij = 9{q — 9jo must hold for all i,j > 0. 
For practical estimation, parameters 0y are additionally split into positive 
and negative parts, that is, 



k 



(4) 




i=l 



(3) is 




i>j 




e 



with 





and 



i>j 



Minimization can be done by using quadratic programming methods. We 
used R 2.9.0 [R Development Core Team (2009)] and the interior point 
optimizer from add-on package kernlab [Karatzoglou et al. (2004)]. 
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The problem with quadratic programming is that the solution can only 
be computed for a single value s. To obtain a coefficient path (as in Figure 
2), the procedure needs to be applied repeatedly. Moreover, when applying 
the method to our data, we found numerical problems, especially when s 
was small. To attack these problems, we propose an approximate solution 
which can be computed using R add-on package lars [Efron et al. (2004)], 
where "approximate" means that only 9ij ~ 9{q — 9jq holds. For simplicity, 
we assume that weights Wij = 1 are chosen. But results can be generalized 
easily (see Section 2.5). For the approximation we exploit that the proposed 
estimator can be seen as the limit of a generalized Elastic Net. The original 
Elastic Net [Zou and Hastie (2005)] uses a combination of simple Ridge and 
Lasso penalties. We use a generalized form where the quadratic penalty term 
is modified. With Z so that Z6 = X/3, we define 

fl 7 , A = argmin j (y - Z6) T (y - Z0) + 7 £ (6n ~ Ojo ~ %) 2 + A £ |0 O -|] . 

6 ^ i>j>0 i>j ' 

A simple choice of Z is Z = (X\0), since 9io = fy, i = 1, . . . ,k. The first 
penalty term, which is weighted by 7, penalizes violations of restrictions 
Qij = 6io — djo- The exact solution of the optimization problem considered 
here is obtained as the limit 

6 = lim # 7 \. 

Hence, with sufficiently high 7, an acceptable approximation should be ob- 
tained. If matrix A represents restrictions 0^ = 9io — 9jo in terms of A9 = 0, 
one may define precision by 

A 7iA = (A9 ltX ) T Ae j>x . 

The lower A 7jA the better. An upper bound is given by 

x(\m\-\^ 

7 

where 9^ LS ^ denotes the least squares estimate (i.e., A = 0) where A9^ LS ^ = 
holds, and \9\ = Yli>j denotes the Li-norm of vector 9. (For a proof see 

the Appendix.) 9^ LS "> can be computed by (? 7i o if any 7 > is chosen. Not 
surprisingly, for higher A higher 7 must also be chosen to stabilize precision. 

The advantage of using the estimate # 7j a is that its whole path can be 
computed using lars [Efron et al. (2004)], since it can be formulated as a 
Lasso solution. With augmented data Z = (Z T , ^/jA T ) T and y = (y T ,0) T , 
one has 

7)A = argmin/ (y - Z9) T (y - Z0) + A^ |% | j, 

8 1 i>3 J 
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which is a Lasso type problem on data (y,Z). 
In the case of ordinal predictors the penalty is 

k 

J(/3) = ^|ft-A-i|, 
1=1 

and the corresponding optimization problem can be directly formulated as 
a simple Lasso type problem. We write 

Q P (P) = (y- Xpf(y - Xf3) + AJ(/3) = (y - X5) T (y - XS) + XJ(5), 
with X = XU~ 1 , 6 = UP, J(S) =^2i =1 \5i\, and 

/ 1 ••• 0\ 

-1 1 ••• 

u = 

o •. 

V ••• -1 l) 

Simple matrix multiplication shows that the inverse of U is given by 

°\ 


1/ 

In other words, the ordinal input is just split-coded [Walter, Feinstein and 
Wells (1987)], and ordinary Lasso estimation is applied. Split-coding means 
that dummies defined by splits at categories i = 1, . . . ,k, that is, 

2 ri, if c>i, 

1 1 0, otherwise. 

Now the model is parameterized by coefficients 5i = Pi — Pi-i, i = 1, ■ • • , k. 
Thus, transitions between category i and i — 1 are expressed by coefficient 
8i. Original dummy coefficients are obtained by back-transformation Pi = 
)] ! s=1 i5 s . By applying penalty £)t=il<S*l> not the whole ordinal predictor 
is selected, but only relevant transitions between adjacent categories. By 
contrast, Walter, Feinstein and Wells (1987) intended the use of classical 
tests for such identification of substantial "between- strata differences." 

2.4. Multiple inputs. In our application, as usual in statistical modeling, 
a set of (potential) regressors is available (see Table 1) and only the relevant 
predictors should be included into the model. In the introduction we already 
considered two predictors, the urban district where a flat is located and the 
decade of construction. For the handling of multiple categorial predictors in 
general, say, x\, . . . , x p , with levels 0, . . . , k\ for variable x\ (I = 1, . . . ,p, and 



U 



-l 



(X 
1 1 

Vi ... 
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fixed p), the presented methods can be easily generalized. The corresponding 
penalty is 



(5) J(/3) = ^J i (A), 

1=1 

with 

h 

JKA) = X>g } IAi - Ail, or j0{) = Y J wf ) \p u - /V-il, 

depending on the scale level of predictor x\. The first expression refers to 
nominal covariates, the second to ordinal ones. 

If multiple predictors are considered, clustering of categories of single 
predictors as well as selection of predictors is of interest. Penalty (5) serves 
both objectives, clustering and selection. If all dummy coefficients that be- 
long to a specific predictor are set to zero, the corresponding predictor is 
excluded from the model. Within each nominal predictor xi, there is also 
an Li-penalty on the differences to the dummy coefficient of the reference 
category. Since the latter is fixed to zero, clustering of all categories of x\ 
means that all coefficients which belong to predictor x\ are set to zero. In 
the ordinal case, this happens if all differences 6u = fin — of adjacent 

dummy coefficients of predictor xi are set to zero. 

2.5. Incorporation of weights. In many situations weig hts wf) + 1 arc 

to be preferred over the simple weights wfj = 1, for example, to obtain the 
adaptive versions described in Propositions 1 and 3 in the Appendix, or when 
predictors differ in the number of levels, as in the rent standard application 
(see Table 1). For nominal variables Bondell and Reich (2009) suggested the 
weights 



(0 , (0 



n 



(6) w ?} = (k l + l)-^l 

where nf denotes the number of observations on level i of predictor x\. 

In the adaptive version the weights contain additionally the factor — 

use of these weights (6) was motivated through standardiza- 
tion of design matrix Z from Section 2.3, in analogy to standardization of 
metric predictors. In the following these weights are also considered, but 
multiplied by 2. If predictor x\ is nominal, the factor (ki + 1) _1 is neces- 
sary to ensure that penalty Ji(Pi) in (5) is of order fcj, the number of (free) 
dummy coefficients. Without these additional weights Ji(0i) would be of or- 
der (ki + l)ki, because the penalty consists of (k\ + l)k\/2 terms if no ordinal 
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structure is assumed. By contrast, if the predictor is ordinal, the penalty is 
already of order k[. Hence, the factor 2(k[ + l)^ 1 is omitted in this case. 

In general, if weig hts wf) / 1 are included, the model just has to be pa- 
rameterized by vector 9 = WO, where W is a diagonal matrix with diagonal 
elements wf) ■ That means the (centered) design matrix needs to be multi- 
plied by W' 1 . 

2.6. Refitting procedures. The most attractive features of the methods 
described above are variable selection and clustering. However, due to pe- 
nalization, estimates are obviously biased. In the usual ANOVA case, this 
is not a problem, since the focus is on the identification of differences, and 
not on quantification. In our case — as in regression analysis in general — we 
are also interested in parameter estimation and prediction accuracy. In or- 
der to reduce the bias, refitting procedures have been proposed by several 
authors, for example, by Efron et al. (2004) under the name "Lars-OLS hy- 
brid," or by Candes and Tao (2007) as "Gauss-Dantzig Selector." In our 
setting, that means that the penalty in (1) is only used for variable selection 
and clustering. After the identification of relevant predictors and clusters, 
parameters are refitted by ordinary least squares. If variable selection and 
clustering are based on the already mentioned adaptive weights, asymptotic 
behavior is obtained which is similar to the nonrefitting case; for details, see 
the remarks on Proposition 1 in the Appendix. However, before we apply 
the refitting method to the rent data (where n < oo), its effect is also tested 
in simulation studies (see Section 3.2). 

3. Numerical experiments. Before applying the presented methodology 
to the Munich rent standard data in Section 4, the different approaches are 
tested and some characteristics are investigated in simulation studies. 

3.1. An illustrative example. In the first simulation scenario only one 
predictor and a balanced design are considered with 20 (independent) ob- 
servations in each of i = 0, ...,8 classes. In class i the response is N(/ij, 
4)-distributed, where the means form three distinct groups of categories, 
that is, /Uo = \i\ = (J,2, = = fi§, fie = fiT = fis- Figure 3 (left) shows em- 
pirical distributions as well as the true fii, which are marked by dashed lines. 
Moreover, exact and approximate paths of dummy coefficients (middle) are 
shown, where the nonadaptive version of penalty J{i3) is employed. That 
means the weighting term \fi\ LS ^ — j3^ S \~ l is omitted. Since there is only 
one predictor and the design is balanced, simple weights Wij = 1 can be used. 
The x-axis indicates s/s max , the ratio of actual and maximal s value. The 
latter results in the ordinary least squares (OLS) estimate. With decreasing 
s (or increasing penalty A), categories are successively grouped together. 
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First, classes with the same true mean are grouped as desired; for s = the 
model finally consists of the intercept only — the empirical mean of y. For 
the approximation, ^7 = 10 5 has been chosen. It is hard to see any differ- 
ence between approximate and exact solution. Indeed, for s / s max 
precision A 7 x < 10~ 17 is obtained. Also in the case of the "exact" solution, 
restrictions are just "numerically" met. In the given example precision of 
the "exact" solution is about 10 -18 (or better), which is quite close to the 
"approximate" solution. So in the following, only approximate estimates are 
used. 

In the right panel of Figure 3, the results of the adaptive version which 
uses the additional weights Wij = \^ LS ^ — /3j i5 ^| _1 are shown. Grouping is 
quite good, and compared to the nonadaptive version, bias toward zero is 
much smaller at the point of perfect grouping. 

In a second scenario, settings and data visualized in Figure 3 (left) are 
considered again, but now it is assumed that class labels have an ordinal 
structure. Hence, penalty (4) is employed. Resulting paths of dummy co- 
efficients are plotted in Figure 4. Even for the nonadaptive version (left), 
grouping is quite good. Moreover, before optimal grouping is reached, bias 
toward zero seems to be quite low. Of course, assuming an ordinal class 



Emp. within-Class Distributions Exact Solution Approximate Solution Adaptive Version 




1 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

level s/s max s/s max s/s ma , 



Fig. 3. Empirical within-class distributions (left), exact and approximate coefficient 
paths (middle), as well as results of the adaptive version (right); constant a is marked by 
the dashed line. 
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structure, which is actually given because all categories with truly equal 
coefficients are groups of neighbors, makes the estimation problem easier. 

3.2. Comparison of methods. For the comparison of different methods 
a setting with 8 predictors is considered — 4 nominal and 4 ordinal factors. 
For both types of variables we use two factors with 8 categories and two 
with 4, of which in each case only one is relevant. The true nonzero dummy 
coefficient vectors are (0,l,l,l,l,-2,-2) T and (0,2,2) T for the nominal 
predictors, and (0, 1, 1, 2, 2, 4, 4) T and (0, — 2,— 2) T for the ordinal predic- 
tors (constant a = 1). A training data set with n = 500 (independent) ob- 
servations is generated according to the classical linear model with stan- 
dard normal error e. The vectors of marginal a priori class probabilities are 
(0.1, 0.1, 0.2, 0.05, 0.2, 0.1, 0.2, 0.05) T and (0.1, 0.4, 0.2, 0.3) T for 8-level and 4- 
level factors, respectively. The coefficient vector is estimated by the proposed 
method, using adaptive as well as nonadaptive weights. In addition, the ef- 
fect of taking into account marginal class frequencies nf is investigated, 
which means we check what happens if ((n^ + n*- ) /n) 1 / 2 is omitted in (6). 
Moreover, refitting is tested (as already mentioned), that is, the penalization 
is only used for variable selection and clustering. After the identification of 




Fig. 4. Paths of dummy coefficients for data as in Figure 3, but assuming an ordinal 
class structure, nonadaptive (left) and adaptive (right) version; constant a is marked by 
the dashed line. 
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Table 2 

Definition of labels used in Figures 5 and 6 

adapt Adaptive version, i.e., weighting terms |/3f LS ' — /?j LS ^| _1 are used 
stdrd Standard (nonadaptive) version, i.e., terms 

\$\ LS) -Pf^]' 1 are omitted 
n(ij) Marginal class frequencies are taken into account, 

i.e., ((nf + n^')/n) 1//2 are used in (6) 
rf Refitting was performed 



relevant predictors and clusters, parameters are refitted by ordinary least 
squares. 

For the determination of the right penalty A, resp. s value, we use 5- 
fold cross-validation. Of course, any information criterion like AIC or BIC 
could also be employed. For the latter some measure of model-complexity is 
needed. In analogy to the Fused Lasso [Tibshirani et al. (2005)], the degrees 
of freedom of a model can be estimated by 

df=i+j>*, 

i=i 

where k\ denotes the number of unique nonzero dummy coefficients of pre- 
dictor x\ and the 1 accounts for the intercept. 

After estimation of coefficient vector j3, the result is compared to the 
true parameters. The MSE is computed, as well as False Positive and False 
Negative Rates (FPR/FNR) concerning variable selection and clustering. As 
far as variable selection is concerned, "false positive" means that any dummy 
coefficient of a pure noise factor is set to nonzero; if clustering is considered, 
it means that a difference within a nonnoise factor which is truly zero is set 
to nonzero. By contrast, "false negative" means that all dummy coefficients 
of a truly relevant factor are set to zero, or that a truly nonzero difference is 
set to zero, respectively. Figure 5 shows the results for 100 simulation runs; 
labels are defined in Table 2. 

In addition to the MSE and FPR/FNR, an independent test set of 1000 
observations is generated and prediction accuracies are reported in terms of 
the mean squared error of prediction. For comparison the performance of 
the ordinary least squares (OLS) estimate is also given. MSE and prediction 
accuracy are shown as boxplots to give an idea of variability; FPR (dark 
gray) and FNR (light-colored) are averaged over all simulation runs. It is 
seen that all methods are superior to the OLS. Concerning FPR and FNR, 
differences between pure adaptive/nonadaptive approaches and refitting are 
caused by the fact that not necessarily the same models are selected, because 
in cross-validation already refitted coefficients are used. 
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Fig. 5. Evaluation of adaptive and nonadaptive (standard) as well as refitting (rf) ap- 
proaches, taking into account class sizes (rii, rij) or not, for comparison the results for the 
ordinary least squares (ols) estimator are also given; considered are the mean squared error 
of parameter estimate, prediction accuracy and false positive/negative rates (FPR/FNR) 
concerning variable selection and identification of relevant differences (i.e., clustering) of 
dummy coefficients. 



As already illustrated by Bondell and Reich (2009) and supported by 
Propositions 1 and 3 in the Appendix, selection and grouping character- 
istics of the adaptive version are quite good — at least compared with the 
standard approach. Also, accuracies of parameter estimates and prediction 
of the adaptive version are very high in our simulation study. Via refitting, 
they can only be slightly improved. In the case of standard weights, the 
improvement is much more distinct. However, the most important effect of 
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refitting is on variable selection and clustering — in both the adaptive and 
the nonadaptive case. It can be seen that via refitting error rates are enor- 
mously diminished — concerning false variable selection as well as clustering. 
This finding can be explained by the bias which is caused by shrinking. If 
tuning parameters are determined via cross-validation (as done here), with 
refitting the chosen penalty parameter A may be higher than without, be- 
cause in the latter higher penalty directly results in a higher bias 
which may deteriorate prediction accuracy on the test fold. Since in the 
case of refitting the penalty is only used for selection purposes, a higher 
value does not necessarily cause higher coefficient shrinkage and bias. Ap- 
parently, however, in many of our simulated higher penalty would 
have been necessary to obtain accurate variable selection and grouping. 

In a modified scenario further noise variables are included, 4 nominal and 
4 ordinal, each with 6 levels and constant marginal a priori class probabili- 
ties. Qualitatively, results (shown in Figure 6) are similar to those obtained 
before. However, since the number of independent variables has been con- 
siderably increased, the performance of the ordinary least squares estimates 
is even worse than before. This also explains why (in the adaptive case) the 
MSE and prediction accuracies cannot be really improved by OLS refitting, 
and why in the case of refitting variability is higher. Nevertheless, variable 
selection and clustering results are still distinctly better if refitting is done. 

As an overall result, it can be stated that, given a regression problem, 
refitting has the potential to distinctly improve selection and clustering re- 
sults in the n < oo case, while providing accurate parameter estimates (if 
n is not too small compared to p). Therefore, it can be assumed to be a 
suitable approach for our regression analysis. Moreover, taking into account 
marginal class frequencies seems to (slightly) improve estimation results. 

4. Regularized analysis of Munich rent standard data. For the estima- 
tion of regression coefficients with predictors from Table 1, we consider the 
approaches which performed best in the previous section; more concrete, 
both the adaptive as well as the standard (nonadaptive) version remain 
candidates, but each with refitting only and taking marginal class frequen- 
cies into account. In the following we first analyze the data and then evaluate 
the performance of the approach (using the rent data) comparing it to ordi- 
nary least squares and Group Lasso estimates, which do not provide variable 
selection and/or clustering of categories. 

4.1. Data analysis. In the considered application more than 2000 obser- 
vations are available for the estimation of 58 regression parameters. Thus, 
OLS estimation works, and (in the light of the simulation study before) it is 
to be expected that refitting distinctly improves estimation accuracy as well 
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Fig. 6. Evaluation of different approaches in the presence of many noise variables: adap- 
tive and nonadaptive (standard) as well as refitting (rf), taking into account class sizes 
(rii, Tij) or not, for comparison also the ordinary least sguares (ols) estimator; consid- 
ered are the mean squared error of parameter estimate, prediction accuracy and false posi- 
tive/negative rates (FPR/FNR) concerning variable selection and identification of relevant 
differences (i.e., clustering) of dummy coefficients. 



as variable selection and clustering performance of the proposed penalized 
approach. 

Figure 7 shows the (10-fold) cross-validation score as a function of s/s max , 
for the refitted model with nonadaptive (dashed black) as well as adaptive 
weights (solid red). It is seen that penalized estimates, in particular, refit- 
ting with adaptive weights, may improve the ordinary least squares estimate 
(i.e., s/s max = 1) in terms of prediction accuracy. It is not surprising that 
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Fig. 7. Cross-validation score as a function of s/s max if refitting with standard (dashed 
black) or adaptive (solid red) weights is used for the analysis of Munich rent standard data. 

adaptive weights show better performance than nonadaptive ones, since sam- 
ple size is high, which means that ordinary least squares estimates are quite 
stable, and the latter decisively influence adaptive weights. So we choose 
adaptive weights at cross-validation score minimizing s/s max = 0.61 (marked 
by dotted line in Figure 7) . The estimated regression coefficients are given in 
Table 3. There is no predictor which is completely excluded from the model. 
However, some categories of nominal and ordinal predictors are clustered, 
for example, houses constructed in the 1930s and 1940s, or urban districts 
14, 16, 22 and 24. It is interesting that rents of houses constructed shortly 
before the Second World War and those constructed within or shortly after 
the war do not substantially differ. 

The biggest cluster, which contains 8 categories, is formed within the 25 
districts. A map of Munich with color coded clusters (Figure 8) illustrates 
the 10 found clusters. The map has been drawn using functions from R add- 
on package BayesX [Kneib et al. (2009)]. The most expensive district is the 
city center. After inspection of OLS estimates (e.g., in Figure 2), it could be 
expected that rather cheap districts 14 and 24 are fused. It was not clear, 
however, if they are additionally collapsed with any other districts, and if so, 
whether fused with {16, 22} or {11, 23}. Based on our regularized analysis, it 
can now be stated with good reason that rents in districts 14, 16, 22 and 24 
are comparatively low and do not substantially differ, which is in agreement 
with judgements from experts and feelings of laymen, because Munich's 
deprived areas are primarily located in these (nonadjacent) districts. The 
cluster that contains district 12, however, partly contradicts experiences of 
experts and tenants. The problem is that this district is very large and 
reaches from the city center to the outskirts in the north. So very expensive 
flats which are close to the city center are put together with cheaper ones 
on the outskirts. But, on average, rents are rather high in this district, 
which causes it to be clustered with other expensive but more homogeneous 
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Table 3 




Estimated regression coefficients for Munich rent standard data using i 


idaptive weights 


with refitting, and (cross- 


validation score minimizing) s/s m ax 


= 0.61 


Predictor 


Label 


Coefficient 


1 i if ^ircnnt 

llllcl Lcpi 




iz.oy i 


Urban district 


14, 16, 22, 24 


-1.931 




11, 23 


-1.719 




7 


-1.622 




R 10 1 ^ 17 10 90 91 9^ 

o, iu, io, ij, iy, zu, z±, zo 


-1.361 




6 


i nfii 




9 


— U.yoU 




13 


-0.886 




2, 4, 5, 12, 18 


-0.671 






-0.403 


Year of construction 


1920s 


-1.244 




1930s, 1940s 


-0.953 




1950s 


-0.322 




1960s 


0.073 




1970s 


0.325 




1980s 


1.121 




i none onnrio 

iyyus, zuuus 


1.624 


Number of rooms 


4, 5, 6 


-0.502 




3 


-0.180 




2 


0.000 


Quality of residential area 


good 


U.OI o 




excellent 


1.444 


Floor space (m 2 ) 


[140, oo) 


-4.710 




[90,100), [100,110), [110,120), 






[120,130), [130,140) 


-3.688 




[60,70), [70,80), [80,90) 


-3.443 




[50,60) 


-3.177 




[40, 50) 


-2.838 




[30,40) 


-1.733 


Hot water supply 


no 


-2.001 


Central heating 


no 


-1.319 


Tiled bathroom 


no 


-0.562 


Suppl. equipment in bathroom 


yes 


0.506 


Well equipped kitchen 


yes 


1.207 



areas. In an ordinary least squares model, district 12 is even identified as 
belonging to the three most expensive districts (see also Figures 1 and 2). 
Penalized estimation ranks it only among the top seven. But it should be 
noted that in the final regression model there is also an ordinal predictor 
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-1.9307 



Fig. 8. Map of Munich indicating clusters of urban districts; colors correspond to esti- 
mated dummy coefficients from Table 3. 

included which indicates the quality of the residential area and allows for 
further discrimination between flats which are located in the same district. 

Not surprisingly, rent per square meter goes down if the number of rooms 
increases. Between four, five or more rooms, however, no relevant differences 
are identified. Flats with two rooms are fused with the reference category, 
since the corresponding dummy coefficient is set to zero. The fact that no 
differences between flats with one and two rooms are found is caused by the 
inclusion of floor space into the model. Existing differences are obviously 
modeled via the variable which directly measures the flat's size, with the 
effect that for larger flats the rent per square meter is lower. Starting with 
small apartments, the decrease of rents is quite apparent (between ca. 20 
and 60 m 2 ), then it is much slower. Between 90 and 140 m 2 , for example, 
no differences are identified with respect to rent per square meter. The fact 
that the covariate which indicates the number of rooms is not completely 
excluded from the model, although the flat's floor space is also considered, 
shows that there are dependencies between rent and the number of rooms 
which do not only refer to the flat's size. If covariate floor space is held 
constant, but the number of rooms is increased, rents tend to go down. 

All in all, the selected model has 32 degrees of freedom, that is, 32 unique 
nonzero coefficients (including the intercept), which means that the com- 
plexity of the unrestricted model (58 df) is reduced by about 45%. 

4.2. Using spatial information. A possible alternative to treating the ur- 
ban district as a nominal predictor is to include geographical information. 



SPARSE MODELING OF CATEGORIAL EXPLANATORY VARIABLES 



21 




Fig. 9. Map of Munich indicating clusters of urban districts, if just differences between 
dummy coefficients of neighboring districts are penalized. 

One can use distance measures or neighborhood effects when looking for 
clusters. A simple approach we used is to penalize — in analogy to the or- 
dinal predictor case — only differences between dummy coefficients of neigh- 
boring districts. In Figure 9 a map of Munich is shown which results from 
such neighborhood penalization. One problem with this map is that district 
12 (which reaches from the center to the north) is now fused with three 
expensive adjacent districts in the center. We also fitted a more advanced 
neighborhood weighting scheme, which uses the length of the boundary be- 
tween the corresponding districts as weights. Then the difference between 
district 12 and a neighboring (and cheap) district in the north would get 
some higher weight. However, even that modification does not solve the 
second problem linked with that kind of spatial information based regu- 
larization: Two nonadjacent districts will not be fused if they are not also 
fused with a whole set of districts building a chain that connects them. In 
Figure 9 the two light-colored districts in the west and southeast (22 and 
16) seem quite similar. In contrast to Figure 8 and Table 3, however, they 
are not fused. The corresponding difference of dummy coefficients is about 
0.007 — close to, but not exactly zero. Generally speaking, districts which 
are not neighbors may also be quite similar. Therefore, fusion of such dis- 
tricts should be possible, too. Hence, we prefer an approach like our initial 
modeling where all pairwise differences of districts' dummy coefficients have 
been penalized. 

A more general procedure to include spatial information is to incorporate 
this information into the weights Wij in (3). For that purpose weights may be 
additionally multiplied by factors Qj, where Qj contains spatial information. 
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As long as < Qj < oo for all consistency as given in Proposition 1 is 
not affected, and all pairwise differences are still penalized as desired in our 
application. Factor Qj can, for example, be defined as a decreasing function 
of the distance between districts i and j. A special case of such an approach 
is to penalize only differences of neighboring districts as already done before. 
This, however, does not guaranty Qj > for all and did not produce good 
results in our application (as shown above). Furthermore, it seems sensible 
to assume that differences (concerning rents) between the city center and 
the outskirts tend to be larger than differences between outskirts in the west 
and the east of a city. So for defining we may use the information whether 
a district is rather central or peripheral. If q denotes the distance of (the 
center of) district i to the city center (in km), we define 



with a fixed kernel K and bandwidth h. For K we use the Epanechnikov 
kernel, and h = 15 (km), which is roughly the radius of the smallest circle 
around the city center which contains the whole city of Munich. Incorporat- 
ing spatial information this way, however, yields exactly the same clustering 
as already given in Table 3, where urban districts have just been treated as 
a nominal predictor. So we can keep interpretations given above, and will 
just use the districts' categorial character in the following. The finding that 
results do not change if dj are included is obviously due to the fact that 
weights are decisively influenced by the ols terms (see Proposition 1 in the 
Appendix). 

4.3. Evaluation of prediction accuracies and sparsity. The proposed meth- 
ods provide clustering of categories, which results in a sparser model and 
facilitates interpretation in the considered application. In order to evaluate 
their actual prediction accuracies, we perform repeated random splitting of 
the data into training and test sets. That means coefficients are estimated 
on the training data (including determination of tuning parameters and 
weights), and then used to predict the test data. As test set size we choose 
100, and the procedure is independently repeated 100 times. Results are 
shown in Figure 10. Performance is measured in terms of the mean squared 
error of prediction (MSEP). We investigate the refitted adaptive as well as 
the nonadaptive version of the presented regularization technique. For com- 
parison, we also give prediction accuracies for the (most complex) ordinary 
least squares model, and for Group Lasso estimates as proposed by Yuan 
and Lin (2006) or Meier, Van de Geer and Biihlmann (2008). In the case 
of ordinal predictors, the usual within groups simple ridge penalty is re- 
placed by a difference penalty as proposed in Gertheiss and Tutz (2009) and 
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Fig. 10. Prediction performance of the refitted adaptive (adapt) as well as nonadaptive 
(stdrd) sparse modeling of Munich rent standard data, the Group Lasso (grpl) and ordinary 
least squares (ols) fitting; all results (top left) as well as selected pairwise comparisons. 



Gertheiss et al. (2009). For practical estimation of Group Lasso estimates R 
add-on package grplasso [Meier (2007)] was used. 

The first plot (top left) in Figure 10 shows boxplots of the observed 
MSEPs for all four methods. It is seen that all methods perform almost 
equally. This finding is confirmed by pairwise comparisons. Since in each it- 
eration MSEPs of different methods are observed on the same test data, 
we report pairwise differences of corresponding MSEPs. Boxplots which 
tend to be below zero indicate superior performance of the method which 
is quoted first — and vice versa. It just seems that the proposed adaptive 
version is slightly superior to the ordinary least squares estimate. Between 
the different penalization techniques — the presented sparse modeling (adap- 
tive/nonadaptive) and the Group Lasso — there can hardly be observed any 
difference concerning prediction accuracy on the rent standard data. 

It is a quite positive result, however, that prediction accuracies of the con- 
sidered methods are almost identical, because sparsity is the great advantage 
of the modeling which has been applied above to analyze the data. While the 
ordinary least squares model has 58 degrees of freedom, the (refitted adap- 
tive) model which has been chosen on the basis of all data just has 32 df (see 
Table 3). In Figure 11 we now show kernel density estimates of the model 
complexities observed during random splitting of the data. It is seen that 
the adaptive models (solid red) tend to have less degrees of freedom than 
the nonadaptive version (dashed black). But also the latter is far away from 
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Fig. 11. Kernel density estimates of chosen degrees of freedom for the adaptive (solid 
red) and nonadaptive (dashed black) model after repeated random splitting of Munich rent 
standard data. 

the 58 df of the OLS model. Furthermore, the Group Lasso can only perform 
variable selection, but no clustering of single categories. However, in each 
of the considered random splits all factors were selected (not shown) , which 
means that none of the dummy coefficients estimated by the Group Lasso 
were set to zero. Hence, with the (via cross-validation) chosen tuning param- 
eters, the effect of the Group Lasso penalty was just shrinkage/smoothing of 
groups of dummy coefficients, but no variable selection. That means in the 
case of the analyzed rent standard data the Group Lasso does not result in a 
sparser parametrization than the OLS model. In summary, on the rent data 
our model can be expected to be as accurate as competing models, while 
complexity is distinctly reduced and interpretability is increased. 

5. Summary and discussion. We showed how Li-penalization of dummy 
coefficients can be employed for sparse modeling of categorial explanatory 
variables in multiple linear regression. Depending on the scale level of the 
categorial predictor, two types of penalties were investigated. Given just 
nominal covariates, all pairwise differences of dummy coefficients belonging 
to the same predictor are penalized. If the variable has ordinal scale level dif- 
ferences of adjacent coefficients are considered. L\ -penalization causes that 
certain differences are set to zero. The interpretation is clustering of cate- 
gories concerning their influence on the response. In the analysis of the rent 
standard data this meant that, for example, certain urban districts were 
identified where rents do not substantially differ on average. If all dummy 
coefficients which belong to a certain predictor are set to zero, the corre- 
sponding covariate is completely removed from the model. 

In particular, it was shown that the usually applied (and accurate) ordi- 
nary least squares fitting of rent standard data can be improved if categorial 
predictors are adequately penalized. Such improvement is primarily in terms 
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of interpretability and model complexity. Via repeated random splitting of 
the data at hand, it could be shown that model complexity could be re- 
duced by about 40-50% while prediction accuracies did not deteriorate. As 
simulation studies showed, in cases of smaller sample sizes estimation and 
prediction accuracies can also be distinctly improved via the presented Li- 
difference-penalization. 

An alternative approach would be to apply clustering methods on ols es- 
timates, which may give similar results for the considered rent data (see 
Figure 2, though it is not clear, for example, in which way districts {14,24} 
should be fused with other districts). However, this would be a two-step 
procedure and hence less elegant than a penalty based regularization tech- 
nique. Moreover, in case of smaller sample sizes it would severely suffer from 
instability of ols estimates. 

Though penalization with adaptive weights has some nice asymptotic 
properties, simulation studies also showed that in the case of finite n par- 
ticularly variable selection and clustering performance can even be further 
improved via ordinary least squares refitting of fused categories. A general- 
ization of refitting is the so-called relaxed Lasso [Meinshausen (2007)], which 
puts a second penalty on (dummy) coefficients of fused categories. The dis- 
advantage of relaxation is the second tuning parameter. In the case of the 
Munich rent standard, sample sizes are so high that accurate (ordinary) 
least squares estimation is possible, which means that the second penalty 
parameter can be omitted. 

In the case of ordinal predictors, computation of the proposed estimator 
is easily carried out by the lars algorithm [Efron et al. (2004)], since the 
estimate is just an ordinary Lasso solution, if independent variables are split- 
coded. If predictors are nominal, we showed how procedures designed for 
ordinary Lasso problems can also be used to compute approximate coefficient 
paths. 

APPENDIX 

Asymptotic properties for the unordered case. Let 9 = (9\o, 620, ■ ■ ■ , 
9k,k-i) T denote the vector of pairwise differences 9ij = fa — /3j. Furthermore, 
let C = ■ fal 7^ /3j ,i> j} denote the set of indices i > j corresponding to 

differences of (true) dummy coefficients /3* which are truly nonzero, and C n 
denote the set corresponding to those difference which are estimated to be 
nonzero with sample size n, and based on estimate (3 from (2) with penalty 
(3). If 9 C denotes the true vector of pairwise differences included in C, 9c de- 

"(LS) 

notes the corresponding estimate based on /3, and /3 4 - the ordinary least 
squares estimate of fa, then a slightly modified version of Theorem 1 in 
Bondell and Reich (2009) holds: 
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Proposition 1. Suppose A = A n with \ n j 'y/n —> and X n — > oo, and all 
class-wise sample sizes ni satisfy ni/n — > Ci, where < Cj < 1. Then weights 

Wij = (pij(n)\^ LS ^ — /3 ? - I " S '^| -1 , with <fiij(n) — > q^ (0 < qij < oo) Vz,j, ensure 
that: 

(a) v^(<? c -0£)^ d iV(O,E) ; 

(b) limn^oo P(C n = C) = l. 

Remarks. The proof closely follows Zou (2006) and Bondell and Reich 
(2009), and is given below. The main differences to Bondell and Reich (2009) 
are that a concrete form of the dependence on sample size, specified in 
(f>ij(n), is not yet fixed, and that A n is determined by \ n / y/n — > and A n — >• 
oo. The latter is needed for the proof of asymptotic normality, as given in 
Zou (2006). Bondell and Reich (2009) used A n = O p (\/n), which also allows 
A n = and therefore cannot yield lim n _ i , 00 P{C n = C) = 1. 4>ij{n) only needs 
to converge toward a positive finite value (denoted by qij). Note that the 
covariance matrix T, of the asymptotic normal distribution is singular due 
to linear dependencies of pairwise differences; cf. Bondell and Reich (2009). 
The concrete form of £ results from the asymptotic marginal distribution 
of a set of nonredundant truly nonzero differences as specified in the proof. 

Due to the (additive) form of the penalty (5), theoretic results from above 
directly generalize to the case of multiple categorial inputs, given the number 
p of predictors and the number k\ of levels of each predictor X[ are fixed. 

Simple consistency linin^oo P(\\/3 — /3*|| 2 > e) = for all e > is also 
reached if A is fixed and Wij = 4>ij(n), with 4>ij(n) — > q^ (0 < q^ < oo) Vi, j, 
is chosen. This behavior is formally described in Proposition 2. 

If adaptive weights are used and refitting is applied after the identification 
of clusters and relevant variables, asymptotic behavior is obtained which 
is comparable to Proposition 1. Since clustering and variable selection are 
directly based on the penalty with adaptive weights, part (b) of Proposition 
1 is still valid. Asymptotic normality results from asymptotic normality of 
the ordinary least squares refit. 

Proof of Proposition 1. We first show asymptotic normality, which 
closely follows Zou (2006) and Bondell and Reich (2009). Coefficient vector f3 
is represented by u = ^/n((3 — /?*), that is, j3 = f3* + u/y/n, where f3* denotes 
the true coefficient vector. Then we also have /3 = j3* + u/y/n, with 

u = argmin 1 J/ n ('u), 

u 

where 

».(.) -(v-x(r + ^)) T (y-x(?" + £)) + ^M, 
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As given in Zou (2006), we will consider the limit behavior of (X n /Jn)J(u). 
If p* / 0, then 
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Since by assumption 4>ij(n) — > qij (0 < qij < oo) and X n / Jn — > 0, by Slutsky's 
theorem, we have 
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This also makes clear that assumption A n = O p {y/n) is not enough. If (3* = 



or f3* = f3*, however, 



ft + 



ft + 



U: 



Iff] 



\ft-ft 



and 



respectively. 



Moreover, if (3* = or f3\ = (3*-, due to -y/n-consistency of the ordinary least 
squares estimate (which is ensured by condition n%jn — > < q < 1 Vi), 



lim P(V^\P L } \<^n /2 ) 
n— >oo 

lim P{^ S) -pf S) \<Kl 2 ) 

n. — inn J 



respectively, 



since A n — > oo by assumption. Hence 

A n 4>io(n) 
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Ui — Uj 
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\ft 
-ft\ 



oo, 



oo, 



or 



if Ui 0, resp. Ui ^ Uj. That means if for any i,j > with /?* = ft or 

(3* = 0, U{ 7^ or Ui ^ 0, respectively, then (\ n /-*/n)J(u) —> p oo. The rest 
of the proof of part (a) is almost identical to Bondell and Reich (2009). Let 
X* denote the design matrix corresponding to the correct structure, that 
is, columns of dummy variables with equal coefficients are added and col- 
lapsed, and columns corresponding to zero coefficients are removed. Since 
Vi rii/n — > Ci (0 < q < 1), 



-X* T X* 
n 



C > and 



e r X* 



ir 



^dW, with w ~ N(0,a 2 C). 



Let 9c c denote the vector of differences B%j = f3i — ft which are truly zero, 
that is, not from C, and uc c the subset of entries of 9q c which are part of 
u. By contrast, uq denotes the subset of 6c which are in u. As given in Zou 
(2006), by Slutsky's theorem, V n (u) — ^ V(u) for every u, where 



V(u) 



UqCuc — 2uqW, 



oo, 



if d C c = 0, 
otherwise. 



Since V n (u) is convex and the unique minimum of V(u) is (C 1 w,0) T , we 
have [cf. Zou (2006); Bondell and Reich (2009)] 

uc C~ w and uq c — >-<2 0- 
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Hence, uc —>d N(0, c 2 C~ 1 ). By changing the reference category, that is, chang- 
ing the subset of entries of 9 which are part of u, asymptotic normality can 
be proven for all pairwise differences in 9q- 

To show the consistency part, we first note that lim n _ >00 P((z,_7) 6 C n ) = 
1, if £ C, follows from part (a). We will now show that if 
lim n _ > . tX )P((z, j) G C n ) = 0. The proof is a modified version of the one given 
by Bondell and Reich (2009). Let B n denote the (nonempty) set of pairs of 
indices i > j which are in C n but not in C. Then we may choose reference 
category such that ft q = ft q — fto > is the largest difference corresponding 
to indices from B n . Moreover, we may order categories such that fti< ■■ ■ < 
ftz < < ft z +i < " " " < ftk- That means estimate ft from (2) with penalty (3) 
is equivalent to 

ft = argmin {(y - Xftf(y - Xft) + X n J(ft)}, 

{Pl<-<P*<0<Pz+l<-<Pk} 

with 

J(ft) = J2 &j(n) % h 

i>j;i,j^0 \Pi Pj I 

i>z+l \P% I i<z \Pi I 

Since ft q ^ is assumed, at the solution ft this optimization criterion is 
differentiable with respect to ft q . We may consider this derivative in a neigh- 
borhood of the solution where coefficients which are set equal remain equal. 
That means terms corresponding to pairs of indices which are not in C n can 
be omitted, since they will vanish in J(ft). If x q denotes the qth column of 
design matrix X , due to differentiability, estimate ft must satisfy 

Q' q {ft) 2x T q {y-Xft)_ 



jn v n 
with 



and 



j<q;(q,j)£C \Pi Pj I i>q;(i,q)eC \Pi Pi 



Jn z — ' 

j<q;{q,j)£Bn 



\ft? S) -ftf S \ 



If /3* denotes the true coefficient vector, Q'Jft)/JH can be written as 



(/3) = 2xT(y-Xft) = 2x T q Xj^i{ft*-ft) 2x£e 
n \fn n Jn 
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From part (a) and applying Slutsky's theorem, we know that 2x^Xy/n(ft — 

ft)/n has some asymptotic normal distribution with mean zero, and 2xTe/ ' \fn 
as well (by assumption, and applying the central limit theorem); cf. Zou 
(2006). Hence, for any e > 0, we have 

lim P(Q' q 0)/^<X 1 n /i -e) = l. 

Since X n /^/n^■ 0, we also know 3e > such that lim n _ >00 P(|^4 n | < e) = 1. 
By assumption, A n — > oo; due to -v/n-consistency of the ordinary least squares 
estimate, we know that 

lim P(v^|/3f 5 )-/3 7 (i5) |<Ay 2 ) = l, 
if (q,j) G &n- Hence, 

lim P(D n > Xl/ 4 ) = 1. 

n— Yoo 

As a consequence, 

lim P(Q' (4)/Vn = A n + D n ) = 0. 

That means if ^ C, also 

lim P((i,j)€C n ) = 0. □ 

n— >oo 

Proposition 2. Suppose < A < oo /ias 6een fixed, and all class-wise 
sample sizes n, satisfy rii/n — > Ci, where < c% < 1. T/ien weights Wij = 
4>ij(n), with 4>ij(n) — > qij (0 < g^- < oo) Vi, j, ensure that estimate ft from 
(2) with penalty (3) is consistent, that is, lim nH>00 P(||/3 — /3*|| 2 > e) = for 
all e > 0. 

Proof. If ft minimizes Q p (ft) from (1), then it also minimizes Q p (ft)/n. 
The ordinary least squares estimator minimizes Q(ft) = {y — Xft) T {y — 

Xft), resp. Q(ft)/n. Since Q p {ft)/n -> p Q(ft (LS ^/n and Q p (ft)/n -> p Q(ft)/n, 
we have Q(ft)/n — > p Q(ft^ LS ^)/n. Since ft^ LS ^ is the unique minimizer of 
Q(ft)/n, and Q(ft)/n is convex, we have ft —¥ p ft( LS \ and consistency follows 
from consistency of the ordinary least squares estimator ft^ LS \ which is 
ensured by condition n^/n — > Ci, with < q < 1 Vi. □ 

Asymptotic properties for the ordered case. Let now C = {i > : ft* ^ 

ft*_i} denote the set of indices corresponding to differences of neighboring 
(true) dummy coefficients ft\ which are truly nonzero, and again, C n denote 
the set corresponding to those difference which are estimated to be nonzero, 
based on estimate ft from (2) with penalty (4). The vector of first differences 



SPARSE MODELING OF CATEGORIAL EXPLANATORY VARIABLES 



31 



fii = Pi — Pi-ii i = 1) • • ■ j k, is now denoted as 5 = (<5i, . . . , dk) T ■ In analogy to 
the unordered case, 5 C denotes the true vector of (first) differences included 

~(LS) 

in C, and 5c the corresponding estimate. With p- denoting the ordinary 
least squares estimate of the following proposition holds. 

Proposition 3. Suppose A = A n with \ n jyfn — > and \ n — > oo, and a// 
class-wise sample sizes ni satisfy rii/n — > Ci, where < c, < 1. Then weights 

Wi = 4>i{n)\fi\ LS ^ — ft^f toi</i 0j(n) — s- ft (0 < qi < oo) Vi, ensure that: 

(a) v^-^-^JVCO.E), 

(b) lim„^ 00 P(C n = C) = l. 

Remarks. The proof is a direct application of Theorem 2 in Zou (2006), 
as sketched below. As before in the unordered case, if A is fixed and wi = 
4>i(n), with (fti(n) — > q^ (0 < qi < oo) Vi, j, simple consistency lim n _ i , 00 P(||/3 — 
/3*|| 2 > e) = for all e > is reached. The proof is completely analogue to 
the proof of Proposition 2 before. 

Proof of Proposition 3. In Section 2.3 it has been shown that 
the proposed estimate given an ordinal class structure is equivalent to a 
Lasso type estimate, if ordinal predictors are split-coded. That means since 
4>i(n) — >■ qi (0 < qi < oo) Vz by assumption, and employing Slutsky's theorem 
(the proof of), Theorem 2 about the adaptive Lasso by Zou (2006) can be 
directly applied. Condition nj/n — > Ci, with < c, < 1 Vi, ensures that the 
ordinary least squares estimate is -y/n-consistent. □ 



Precision of the approximate solution. 



Proposition 4. // restriction Oij = 9io — 0jo is represented by AO = 0, 
define J>X = argmin e {(y - Z0) T (y - Z0) + ~i{A0) T A0 + A|0|}, where 9 = 
(6io, ■ ■ ■ ,0k,k-i) T an d |0| = Yli>j Then with 7 > and A > 0, A 7iA = 
(A0~ U \) T AO^^x is bounded above by 

A(I^I-IVI) 



A 7iA < 



1 



where 0^- LS ^> denotes the least squares estimate (i.e., A = 0) where A0( LS ^ = 
holds. 

Proof. Obviously, for all 7 > and A > 0, 

(y - Z0 1)X ) T {y - Z0 1)X ) + A|0 7>A | + 7 A 7)A 

< (y - z0( LS Y(y - zoW) + x\0^\ . 
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Since also 

(y - Z§ ,xf(y - Z9 ,x) + A|0 o ,a| <{y- Ze i>x f(y - zL hX ) + A|0 7)A |, 

and 

(y - Z9 , x f(y - Z V) > (y - Z9^f \y - Z§^), 

we have 

7 A 7iA <A(|^ ls )|-|0 o ,a|). □ 
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